Hotel Booking Analysis

The purpose of the analysis is to select meaningful estimator from the dataset and choose a best model that can predict booking cancellation. The dataset contains data from two hotels with 119390 rows and 32 columns, and it was downloaded from https://www.kaggle.com/jessemostipak/hotel-booking-demand

Content:

1. EDA

The dataset contains data from two hotels, Resort Hotel and City Hotel. I will conduct exploratory data analysis by answering the following questions.

  • Where do guests come from?
  • Where do the bookings come from?
  • What is the ADR by booking channels?
  • What is the ADR per month for each hotel?
  • What is the customer type for each hotel?
  • How long do guests stay on average at each hotel?
  • Do guests tend to cancel their bookings if they booked earlier?
  • Do guests booked earlier tend to cancel?
  • Do guests who paid no deposit tend to cancel?
  • Do guests tend to cancel if they cancelled before?

2. Evaluate Feature Importance

Except the target variable, we have 31 features in this dataset. In this section, we will evaluate which feature plays a more important role in cancellation rate.

3. Predict Cancellation

Cancellation prediction plays a crucial role in forecasting demand and improving sellout efficiency, thus maximizing hotel revenue. In this section, we will find a best model with the highest prediction accuracy.

Now, let's start the journey!

- Overview the data

In [34]:
# Import some libraries
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

import plotly.offline as pyo
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from scipy import stats
%matplotlib inline

import plotly.io as pio
pio.renderers.default='notebook'
In [35]:
# import data
htl_data_raw = pd.read_csv("hotel_bookings.csv")
# Display the first 5 rows of our data set
htl_data_raw.head()
Out[35]:
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults ... deposit_type agent company days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests reservation_status reservation_status_date
0 Resort Hotel 0 342 2015 July 27 1 0 0 2 ... No Deposit NaN NaN 0 Transient 0.0 0 0 Check-Out 2015-07-01
1 Resort Hotel 0 737 2015 July 27 1 0 0 2 ... No Deposit NaN NaN 0 Transient 0.0 0 0 Check-Out 2015-07-01
2 Resort Hotel 0 7 2015 July 27 1 0 1 1 ... No Deposit NaN NaN 0 Transient 75.0 0 0 Check-Out 2015-07-02
3 Resort Hotel 0 13 2015 July 27 1 0 1 1 ... No Deposit 304.0 NaN 0 Transient 75.0 0 0 Check-Out 2015-07-02
4 Resort Hotel 0 14 2015 July 27 1 0 2 2 ... No Deposit 240.0 NaN 0 Transient 98.0 0 1 Check-Out 2015-07-03

5 rows × 32 columns

In [36]:
# Display the shape of dataset
htl_data_raw.shape
Out[36]:
(119390, 32)
In [37]:
# copy the raw dataset 
htl_data = htl_data_raw.copy()
In [38]:
# Display data type of each attribute
htl_data.dtypes
Out[38]:
hotel                              object
is_canceled                         int64
lead_time                           int64
arrival_date_year                   int64
arrival_date_month                 object
arrival_date_week_number            int64
arrival_date_day_of_month           int64
stays_in_weekend_nights             int64
stays_in_week_nights                int64
adults                              int64
children                          float64
babies                              int64
meal                               object
country                            object
market_segment                     object
distribution_channel               object
is_repeated_guest                   int64
previous_cancellations              int64
previous_bookings_not_canceled      int64
reserved_room_type                 object
assigned_room_type                 object
booking_changes                     int64
deposit_type                       object
agent                             float64
company                           float64
days_in_waiting_list                int64
customer_type                      object
adr                               float64
required_car_parking_spaces         int64
total_of_special_requests           int64
reservation_status                 object
reservation_status_date            object
dtype: object
In [39]:
# A general overview of the data
htl_data.describe(include = "all").T
Out[39]:
count unique top freq mean std min 25% 50% 75% max
hotel 119390 2 City Hotel 79330 NaN NaN NaN NaN NaN NaN NaN
is_canceled 119390 NaN NaN NaN 0.370416 0.482918 0 0 0 1 1
lead_time 119390 NaN NaN NaN 104.011 106.863 0 18 69 160 737
arrival_date_year 119390 NaN NaN NaN 2016.16 0.707476 2015 2016 2016 2017 2017
arrival_date_month 119390 12 August 13877 NaN NaN NaN NaN NaN NaN NaN
arrival_date_week_number 119390 NaN NaN NaN 27.1652 13.6051 1 16 28 38 53
arrival_date_day_of_month 119390 NaN NaN NaN 15.7982 8.78083 1 8 16 23 31
stays_in_weekend_nights 119390 NaN NaN NaN 0.927599 0.998613 0 0 1 2 19
stays_in_week_nights 119390 NaN NaN NaN 2.5003 1.90829 0 1 2 3 50
adults 119390 NaN NaN NaN 1.8564 0.579261 0 2 2 2 55
children 119386 NaN NaN NaN 0.10389 0.398561 0 0 0 0 10
babies 119390 NaN NaN NaN 0.00794874 0.0974362 0 0 0 0 10
meal 119390 5 BB 92310 NaN NaN NaN NaN NaN NaN NaN
country 118902 177 PRT 48590 NaN NaN NaN NaN NaN NaN NaN
market_segment 119390 8 Online TA 56477 NaN NaN NaN NaN NaN NaN NaN
distribution_channel 119390 5 TA/TO 97870 NaN NaN NaN NaN NaN NaN NaN
is_repeated_guest 119390 NaN NaN NaN 0.0319122 0.175767 0 0 0 0 1
previous_cancellations 119390 NaN NaN NaN 0.0871178 0.844336 0 0 0 0 26
previous_bookings_not_canceled 119390 NaN NaN NaN 0.137097 1.49744 0 0 0 0 72
reserved_room_type 119390 10 A 85994 NaN NaN NaN NaN NaN NaN NaN
assigned_room_type 119390 12 A 74053 NaN NaN NaN NaN NaN NaN NaN
booking_changes 119390 NaN NaN NaN 0.221124 0.652306 0 0 0 0 21
deposit_type 119390 3 No Deposit 104641 NaN NaN NaN NaN NaN NaN NaN
agent 103050 NaN NaN NaN 86.6934 110.775 1 9 14 229 535
company 6797 NaN NaN NaN 189.267 131.655 6 62 179 270 543
days_in_waiting_list 119390 NaN NaN NaN 2.32115 17.5947 0 0 0 0 391
customer_type 119390 4 Transient 89613 NaN NaN NaN NaN NaN NaN NaN
adr 119390 NaN NaN NaN 101.831 50.5358 -6.38 69.29 94.575 126 5400
required_car_parking_spaces 119390 NaN NaN NaN 0.0625178 0.245291 0 0 0 0 8
total_of_special_requests 119390 NaN NaN NaN 0.571363 0.792798 0 0 0 1 5
reservation_status 119390 3 Check-Out 75166 NaN NaN NaN NaN NaN NaN NaN
reservation_status_date 119390 926 2015-10-21 1461 NaN NaN NaN NaN NaN NaN NaN

- Data Cleaning

In [40]:
# check null values for each attribute
htl_data.isnull().sum()
Out[40]:
hotel                                  0
is_canceled                            0
lead_time                              0
arrival_date_year                      0
arrival_date_month                     0
arrival_date_week_number               0
arrival_date_day_of_month              0
stays_in_weekend_nights                0
stays_in_week_nights                   0
adults                                 0
children                               4
babies                                 0
meal                                   0
country                              488
market_segment                         0
distribution_channel                   0
is_repeated_guest                      0
previous_cancellations                 0
previous_bookings_not_canceled         0
reserved_room_type                     0
assigned_room_type                     0
booking_changes                        0
deposit_type                           0
agent                              16340
company                           112593
days_in_waiting_list                   0
customer_type                          0
adr                                    0
required_car_parking_spaces            0
total_of_special_requests              0
reservation_status                     0
reservation_status_date                0
dtype: int64
In [41]:
# check missing values percentage for company, agent and country columns
def pct_col(x):
    perc = round(x.isnull().sum() / len(x) * 100,2)
    return perc

print('Missing value ratios:\nCompany: {}\nAgent: {}\nCountry: {}'.format(pct_col(htl_data['company']),
                                                                                   pct_col(htl_data['agent']),
                                                                                   pct_col(htl_data['country'])))
Missing value ratios:
Company: 94.31
Agent: 13.69
Country: 0.41

We notice that company columns have a lot of missing values. It is hard and meaningless to fill all the rows of company columns for prediction. Therefore, we will remove this column from our dataset. Agent has about 13% missing value, and coutry has 4% missing value. I will keep those two columns for now and see their correlation later.

In [42]:
# remove company column
htl_data.drop(columns = ['company'], inplace = True)
In [43]:
# check the distribution of the children column
htl_data['children'].hist()
Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x121d6cfa0>

Next, I will chean up the 'children' column, since there is only 4 missing values and most of the guests stay in the hotel without children, so I will fill the missing value to 0.

In [44]:
# fill the missing value to 0
htl_data['children'].fillna(0,inplace = True)
In [45]:
# check the null value 
htl_data.isnull().sum()
Out[45]:
hotel                                 0
is_canceled                           0
lead_time                             0
arrival_date_year                     0
arrival_date_month                    0
arrival_date_week_number              0
arrival_date_day_of_month             0
stays_in_weekend_nights               0
stays_in_week_nights                  0
adults                                0
children                              0
babies                                0
meal                                  0
country                             488
market_segment                        0
distribution_channel                  0
is_repeated_guest                     0
previous_cancellations                0
previous_bookings_not_canceled        0
reserved_room_type                    0
assigned_room_type                    0
booking_changes                       0
deposit_type                          0
agent                             16340
days_in_waiting_list                  0
customer_type                         0
adr                                   0
required_car_parking_spaces           0
total_of_special_requests             0
reservation_status                    0
reservation_status_date               0
dtype: int64
In [46]:
htl_data.shape
Out[46]:
(119390, 31)

From the general overview above, some other columns also catch our eyes, such like bookings with zero adult, ten children and negative adr. Next, I will clean up these abnormal columns.

In [47]:
# Count how many columns of adults are zero
htl_data[htl_data['adults'] == 0].count()['adults']
Out[47]:
403
In [48]:
# check unique value for adults
htl_data['adults'].unique()
Out[48]:
array([ 2,  1,  3,  4, 40, 26, 50, 27, 55,  0, 20,  6,  5, 10])
In [49]:
# see how many entries with adults greater than 4
htl_data[htl_data['adults'] > 4].count()['adults']
Out[49]:
16
In [50]:
# change adult value with 0 and greater than 4 to 1
htl_data['adults'].loc[(htl_data['adults'] > 4) & (htl_data['adults'] == 0)] =1
/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexing.py:671: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [51]:
htl_data['adults'].unique()
Out[51]:
array([ 2,  1,  3,  4, 40, 26, 50, 27, 55,  0, 20,  6,  5, 10])
In [52]:
# list unique value in children
htl_data["children"].unique()  
Out[52]:
array([ 0.,  1.,  2., 10.,  3.])
In [53]:
# count how many columns of chilren equals to 10
htl_data[htl_data['children'] == 10].count()['children']
Out[53]:
1
In [54]:
# change children value from 10 to 0
htl_data['children'].loc[(htl_data['children'] == 10)] = 0
In [55]:
# check column 'children''s value
htl_data["children"].unique() 
Out[55]:
array([0., 1., 2., 3.])
In [56]:
# count how many columns of ADR below zero
htl_data[htl_data['adr'] < 0 ].count()['adr']
Out[56]:
1
In [57]:
# locate reservation with negative adr
htl_data.loc[htl_data['adr']<0]
Out[57]:
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults ... booking_changes deposit_type agent days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests reservation_status reservation_status_date
14969 Resort Hotel 0 195 2017 March 10 5 4 6 2 ... 2 No Deposit 273.0 0 Transient-Party -6.38 0 0 Check-Out 2017-03-15

1 rows × 31 columns

The negative adr may be caused by mistake of the agent or the rebate due to guest complaint. Since it is just one reservation, I will leave it for now.

- Exploratory Data Analysis

In [58]:
# See the distribution of two hotels
htl_data['hotel'].hist(bins = 4);
In [59]:
htl_data['hotel'].value_counts(normalize = True)
Out[59]:
City Hotel      0.664461
Resort Hotel    0.335539
Name: hotel, dtype: float64

In the dataset, 66% of hotels are located in the city and 34% are at the resort. Since city hotels and resort hotels usually have different booking patterns, I will analyze them separately. From the general overview, we know that the dataset contains booking data from 2015 to 2017. I am curious about which months the bookings come from since seasonality is a crucial indicator for hotel booking.

In [60]:
# graph booking arrival month for each year from 2015 to 2017
# define a function filter_year_graph to input different years to plot
def filer_year_graph(year):
    plt.figure(figsize=(15,5))
    filer_year = htl_data.query('arrival_date_year == @year')
    sns.countplot(data = filer_year, x = 'arrival_date_month', 
              hue = 'hotel', order = ['January', 'February', 'March','April','May',
                                      'June','July','August','September','October','November',
                                      'December']).set_title('The number of arrivals by month of ' + str(year), fontsize = 20)
    plt.xlabel('Month') 
    plt.ylabel('Count')
filer_year_graph(2015)
filer_year_graph(2016)
filer_year_graph(2017)

The above barcharts show that we have three continuous years for July and August, and two continuous years data for the other ten months. Next, we will do exploratory analysis to answer some interesting questions.

Firstly, I wanted to see how many bookings are materialized.

In [61]:
htl_data['is_canceled'].value_counts(normalize = True)
Out[61]:
0    0.629584
1    0.370416
Name: is_canceled, dtype: float64

37% of the bookings are cancelled in the end among the two hotels. For question 1-6, I will specifically analyze the bookings that are materialized.

In [62]:
# select bookings that are not canceled
data_not_canceled = htl_data[htl_data['is_canceled'] == 0]

Since resort hotel and city hotel usually have very different market and customers, I will split data into resort hotel and city hotel to analyze it separately.

In [63]:
# select bookings from resort hotel
data_resort = data_not_canceled[data_not_canceled['hotel'] == 'Resort Hotel']
# select bookings from city hotel
data_city = data_not_canceled[data_not_canceled['hotel'] == 'City Hotel']

1. Where do the guests come from?

In [64]:
# group by country and count for resort hotel
resort_country_count = data_resort.groupby('country').count()
r1 = resort_country_count.filter(['hotel'])
r2 = r1.sort_values('hotel', ascending=False)
resort_country_count = r2.reset_index()
resort_country_label = resort_country_count['country']
resort_size = resort_country_count['hotel']

# group by country and count for city hotel
city_country_count = data_city.groupby('country').count()
c1 = city_country_count.filter(['hotel'])
c2 = c1.sort_values('hotel', ascending=False)
city_country_count = c2.reset_index()
city_country_label = city_country_count['country']
city_size = city_country_count['hotel']
In [68]:
# pie chart for resort hotel to see the guests origin 
pyo.init_notebook_mode()
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
fig = px.pie(resort_country_count, values = "hotel", names = "country", 
             title = "Home Countries of Guests - Resort Hotel", width = 500)
fig.update_traces(textposition="inside", textinfo="value+percent+label")

fig.show('notebook')
In [67]:
# pie chart for city hotel to see the guests origin 
fig = px.pie(city_country_count, values = "hotel", names = "country", 
             title = "Home Countries of Guests - City Hotel", width = 500)
fig.update_traces(textposition="inside", textinfo="value+percent+label")
fig.show()

The top three countries for guests staying at the resort hotel are PRT, GBR and ESP, which consists of 67.4% of all bookings. Top three countries of origin for the city hotel are PRT, FRA and DEU. We can use this information to target specific market when we launch marketing campaign.

2. Where do the bookings come from?

In [ ]:
# group by and count market segment for resort hotel
resort_market_segment = data_resort.groupby('market_segment').count()
m1 = resort_market_segment.filter(['hotel'])
resort_market_segment = m1.reset_index()
resort_market_segment = resort_market_segment.sort_values(['hotel'],ascending = False)

# group by and count market segment for city hotel
city_market_segment = data_city.groupby('market_segment').count()
m2 = city_market_segment.filter(['hotel'])
city_market_segment = m2.reset_index()
city_market_segment = city_market_segment.sort_values(['hotel'], ascending = False)
In [ ]:
size_resort = resort_market_segment['hotel']
label_resort = resort_market_segment['market_segment']
size_city = city_market_segment['hotel']
label_city = city_market_segment['market_segment']
In [ ]:
plt.rcParams["font.size"] = 10

plt.subplot(1,5,1)
plt.pie(size_resort, labels = label_resort, autopct = '%1.1f%%',radius = 4.0)
plt.title("Market Segment - Resort Hotel", pad = 120)
plt.subplot(1,5,5)
plt.pie(size_city, labels = label_city, autopct = '%1.1f%%',radius = 4.0)
plt.title("Market Segment - City Hotel", pad = 120)

plt.show()

It is no suprising that more than 50% of bookings come from online or offline travel agents for resort hotel and city hotel. Since we need to pay commission to travel agents, we would like to shift booking channels from OTAs to Direct booking. Especially, if we see more bookings come from OTAs than hotel direct compared to last month, we need to be alerted. Check our rates on our brand website and OTAs, eliminate rate parity issue if that is the reason. If not, we will bring this issue to Marketing and collaborate with marketing team creating strategies, such like loyalty programe, to drive more bookings from hotel directly.

3. What is the ADR by booking channels?

In [ ]:
# calculate adr by market segment for resort hotel
d1 = data_resort.groupby('market_segment').adr.agg(['mean'])
resort_adr_mkt = d1.rename(columns = {"mean" :"resort_adr"})
# reset index for resort hotel
resort_adr_mkt = resort_adr_mkt.reset_index()

# calculate adr by market segment for city hotel
d2 = data_city.groupby('market_segment').adr.agg(['mean'])
city_adr_mkt = d2.rename(columns = {"mean" :"city_adr"})
#reset index for city hotel
city_adr_mkt = city_adr_mkt.reset_index()

# merge two datasets
merge_adr_mkt = pd.merge(resort_adr_mkt, city_adr_mkt, on = "market_segment", how = "outer")

# replace NaN to zero in merge_adr_mkt dataset
merge_adr_mkt = merge_adr_mkt.fillna(0)

# round decimal for adr values
merge_adr_mkt = merge_adr_mkt.round(1)
In [ ]:
# graphy adr by market segment
merge_adr_mkt.plot(figsize=(10,5), x = "market_segment", y = ["resort_adr", "city_adr"])
plt.title("ADR by market segment")
plt.ylabel("adr")

ADR from each segment shows same pattern for resort hotel and city hotel. Reservations from hotel direct drive the highest ADR, and ADR for Online TA segment is slightly lower.

4. What is the ADR per month for each hotel?

In [ ]:
# graphy adr per month by hotels
month_order = ['January', 'February', 'March','April','May','June','July','August','September','October','November',
                                      'December']
ordered_month = data_not_canceled.sort_values("arrival_date_month")
ordered_month['arrival_date_month'] =  pd.Categorical(ordered_month['arrival_date_month'], 
                                                          categories=month_order, ordered=True)
plt.figure(figsize = (15,5))
ax = sns.lineplot(x = 'arrival_date_month', y = 'adr', sort = False, data = ordered_month, 
                  hue = 'hotel', style = 'hotel')
plt.title("ADR per month")

The chart clearly shows that July and August are the peak months for resort hotel, however, in November, January and February, ADR is pretty low. Hotel Manager should catch July and August to drive higher ADR. On the contrary, ADR is pretty low from November to February, since we don't have hotel occupancy data, we are not sure whether dropping the price brings more occupancy. If not, Hotel Manager need to find other strategies to maximize revenue.

5. What is the customer type for each hotel?

In [ ]:
# graph number of reservation by customer type for each hotel
plt.figure(figsize = (15,6))
ax1 = sns.countplot(data = data_not_canceled, x = 'customer_type', palette = 'husl', hue = 'hotel')
plt.title("customer type")

6. How long do guests stay on average at each hotel?(staying pattern)

In [ ]:
# filter data just includes columns we need
data_stay = data_not_canceled.filter(['hotel','stays_in_weekend_nights', 'stays_in_week_nights'])
In [ ]:
# groupby data by hotel and count the average for two variables
avg_stay = data_stay.groupby('hotel').mean().round(1)
avg_stay = avg_stay.transpose()
avg_stay
In [ ]:
# graphy customer staying pattern for each hotel
avg_stay.plot(kind = 'bar', colormap='Paired')
plt.xticks(rotation=0)
plt.ylabel("days")
plt.title("nights stay at hotels")

It is no surprising that guests stay at resort hotel longer than city hotel.

7. Do guests tend to cancel their bookings if they booked earlier?

In [ ]:
sns.catplot(data = htl_data, kind = "box", x = 'is_canceled' , y = "lead_time", hue = 'hotel', 
            palette="Set3",height = 5,width = 0.8)

We can see that guests who booked earlier tend to cancel their bookings.

8. Do guests booked earlier tend to cancel?

In [ ]:
# market segment with cancellation
mkt_cxl = htl_data.groupby(['market_segment', 'is_canceled']).size()
mkt_cxl.unstack().plot(kind = "bar", stacked = True, colormap = "bwr")

Guests booked directly and booked from corporate have fewer cancellations. It is superising that more than half of group bookings are cancelled. What are the cancellation policy and attritions for group contract? Hotel Revenue Managers may need to rediscussion it with Sales Manager to avoid high cancellation rate for group bookings.

9. Do guests who paid no deposit tend to cancel?

In [ ]:
# deposit type with cancellation
sns.catplot(data = htl_data, kind = "count", x = "deposit_type", col = "is_canceled", hue = "hotel", palette="hls",
            height = 6, aspect = .8)

Guests booked the reservations without deposit cancel more for both resort and city hotel. It is superising that most of the reservations are not cancelled despite without paying deposit. Based on this information and intensive competitor research, Hotel Manager can reconsider their guarantee policy to improve their sell-out efficiency.

10. Do guests tend to cancel if they cancelled before?

In [ ]:
pc = htl_data.groupby('is_canceled').previous_cancellations.agg(['mean'])
pc
In [ ]:
pc.plot(kind = "bar", colormap='Paired', figsize = (7,7))
plt.ylabel("avg_previous_cxl")
plt.xticks(rotation = 0)

From the chart, we can see the previous cancellation days are highly different between cancelled and non-cancelled reservation. Therefore, I assume previous cancellation is a good indicator to predict cancellation. We can confirm this assumption in the following variable correlation part.

Evaluate Feature Importance

In [ ]:
# check data cleaning result 
htl_data.isnull().sum()  # no null variables left in the dataset

I will replace the missing value of agent to 999, since no agent is filled with number and it is a float.

In [ ]:
# replace missing value in agent with 999

htl_data['agent'] = htl_data['agent'].replace(np.nan, 999.0)
In [ ]:
# check all the elements within each feature
for column in htl_data:
    unique_values = np.unique(htl_data[column])
    uq_len = len(unique_values)
    if uq_len < 15:
        print ("The number of unique values for feature {} is {} -- {}".format(column, uq_len, unique_values))
    else:
        print ("The number of unique values for feature {} is {}".format(column, uq_len))

After investigating the feature of each column, I will drop assigned_room_type column, since rooms are usually assigned at the time of check-in, but cancellation is occurred before checking in. Likewise, reservation_status may create multicollinearity for the cancellation prediction. Therefore, I will remove this column as well. Next, we will change all object variables to dummy variables.

In [ ]:
# delete two columns, assigned_room_type and reservation_status
del htl_data['assigned_room_type']
del htl_data['reservation_status']
In [ ]:
# replace arrival_date_month column to numeric 

replace_month = htl_data.replace({'January' : 1, 'February': 2,'March' : 3, 'April' : 4, 'May': 5, 'June' : 6,
                                  'July' : 7, 'August' : 8, 'September' : 9, 'October' : 10, 'November' : 11, 
                                 'December' : 12})
In [ ]:
# get a list of column name whose data type is object
replace_month.dtypes
In [ ]:
replace_month.dtypes[replace_month.dtypes == np.object].index
In [ ]:
# choose all columns name with object data type
feature_object = list(replace_month.dtypes[replace_month.dtypes == np.object].index)
feature_object
In [ ]:
# changing all categorical variables into numeric representation
new_data = pd.get_dummies(replace_month, columns = feature_object)
new_data.info()
In [ ]:
cancel_cor = new_data.corr()['is_canceled']
cor_array = cancel_cor.abs().sort_values(ascending=False)
In [ ]:
cor_array[1:30]
In [ ]:
cor_array[200:232]

It is interesting to find that the top three important features are deposit_type_Non Refund, deposit_type_No Deposit, country_PRT, lead_time and total_of_special_requests. Previous cancellation is even not in the top ten.Hotel Manager can focus on the top important features and come up with strategic planning to reduce cancellation.
Based on the correlation, we notice that many country dummy variables have very low correlation. In order to improve computation time later on, I will replace those country value with low correlation to dummy variable 'AAA', just keep the countries that most of guests come from.

In [ ]:
# use the dataset we created at the front 
resort_country_count['pct'] = resort_country_count.hotel/resort_country_count.hotel.sum()*100
city_country_count['pct'] = city_country_count.hotel/city_country_count.hotel.sum()*100
In [ ]:
print(resort_country_count.head(11)['pct'].sum())
print(city_country_count.head(17)['pct'].sum())
In [ ]:
# extract the name of the countries 
r = resort_country_count.head(11)
l_r = list(r.country)
c = city_country_count.head(17)
l_c = list(c.country)
print (l_r)
print(l_c)
In [ ]:
# this is the list of countries we would like to keep it
merge_country = list (set(l_r + l_c))
print(merge_country)
In [ ]:
# replace value of countries to 'AAA' except the top 17 countries of bookings
replace_month['country'] = replace_month['country'].where (replace_month['country'].isin (merge_country), other = 'AAA' )
In [ ]:
replace_month['country'].unique()
In [ ]:
# Let's create dummy variables one more time.

new_data = pd.get_dummies(replace_month, columns = feature_object)
new_data.shape

We sucessfully reduced 232 columns to only 73 columns! Next we will start the fun part, modelling!

Modelling

In [ ]:
# splitting the dataset into x and y
x = new_data.drop(['is_canceled'],axis = 1).values
y = new_data['is_canceled'].values
In [ ]:
# splitting the dataset into train and test
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 0.8, test_size = 0.2, random_state = 42)
In [ ]:
# Run Random Forest Model
model_r = RandomForestClassifier()
model_r.fit(x_train, y_train)
In [ ]:
model_r.score(x_test,y_test)
In [ ]:
# Run Decision Tree Model
model_d = DecisionTreeClassifier()
model_d.fit(x_train, y_train)
In [ ]:
model_d.score(x_test, y_test)

Standardize the variables for logistic regression

In [ ]:
# select the input excluding is_canceled column
unscaled_input = new_data.iloc[:, 1:73]
In [ ]:
# Standardize the data
new_data_scaler = StandardScaler()
new_data_scaler.fit(unscaled_input)
In [ ]:
scaled_input = new_data_scaler.transform(unscaled_input)
scaled_input
In [ ]:
# splitting the dataset into train and test

x_train, x_test, y_train, y_test = train_test_split(scaled_input, y, train_size = 0.8, 
                                                    test_size = 0.2, random_state = 42)
In [ ]:
model_l = LogisticRegression(solver = "liblinear")
model_l.fit(x_train, y_train)

Prediction accuracy for Random Forest is much better than the other two models. Next, we will tuning Ramdom Forest Model to see if we can get a better result.

In [ ]:
# increase the number of trees to 400
model_r = RandomForestClassifier(n_estimators = 400)
model_r.fit(x_train, y_train)
model_r.score(x_test,y_test)
In [ ]:
# use out-of-bag samples to estimate
model_r = RandomForestClassifier(n_estimators = 400,oob_score = True )
model_r.fit(x_train, y_train)
model_r.score(x_test,y_test)
In [ ]:
model_r = RandomForestClassifier(n_estimators = 400,n_jobs = -1 )
model_r.fit(x_train, y_train)
model_r.score(x_test,y_test)

Tuning the parameters does not increase model accuracy much. We will use our default random forest model. Finally, we will draw a ROC curve to better visualize the model.

In [ ]:
y_predicted = model_r.predict(x_test)
In [ ]:
# ROC & AUC
from sklearn.metrics import roc_curve, auc
rf_fpr,rf_tpr, threshold = roc_curve(y_test, y_predicted)
auc_rf = auc(rf_fpr, rf_tpr)
In [ ]:
# plot the ROC curve
plt.figure(figsize = (5, 5))
plt.plot(rf_fpr, rf_tpr, marker = '.', label = 'Ramdom Forest(auc = %0.3f)' %auc_rf)
plt.xlabel('False Positive Rate --->')
plt.ylabel('True Positive Rate --- >')
plt.legend()
plt.show()
In [ ]:
# Let's do a Decision Tree ROC to compare
y_pred =  model_d.predict(x_test)
df_fpr,df_tpr, threshold = roc_curve(y_test, y_pred)
auc_rf = auc(df_fpr, df_tpr)
In [ ]:
# plot the ROC curve
plt.figure(figsize = (5, 5))
plt.plot(df_fpr, df_tpr, marker = '.', label = 'Decision Tree(auc = %0.3f)' %auc_rf)
plt.xlabel('False Positive Rate --->')
plt.ylabel('True Positive Rate --- >')
plt.legend()
plt.show()

From above two charts, it obviously shows that random forest is a better model than decision tree, which further confirms our models predition before.